Efficient ECC-Based Directory Implementations for Scalable Multiprocessors
نویسندگان
چکیده
With increasing chip densities, next-generation microprocessor designs have the opportunity to integrate many of the traditional system-level modules onto the same chip as the processor. This integration changes some of the design trade-offs for how and where to store directory information. One extremely attractive option is to support directory data with virtually no memory space overhead by computing memory ECC at a coarser granularity and utilizing the unused bits for storing the directory information. Compared to providing a dedicated memory and datapath for directory storage, this approach leads to lower cost and a simpler design by requiring fewer components and pins. Furthermore, this approach leverages the low latency, high bandwidth path to memory provided by the integration of memory controllers onto the processor chip. However, without careful design, maintaining data and directory bits together can lead to potential inefficiencies in the form of extra memory bandwidth usage and memory controller occupancy, and extra memory latency. This paper describes the techniques used in the context of the Piranha design [3] to provide an efficient ECC-based directory implementation which addresses the occupancy/ bandwidth and latency issues. Our approach for dealing with the occupancy/bandwidth issues involves either eliminating the extra read and write operations or performing partial memory accesses (instead of accessing the whole block). This is achieved by a combination of techniques which include (i) augmenting the L2 caching state to keep track of some critical directory state, (ii) making up dummy data for protocol transactions with a stale memory copy, and (iii) maintaining a partial ECC that is used to compute the combined ECC of the data and the modified directory bits without needing the actual data bits. To address the latency issues, we replicate critical directory state in different segments of the memory line which allows us to efficiently support the critical-wordfirst optimization by pipelining data from memory to the requester before all the data is read from memory. The combination of the above techniques also eliminates all the inefficiencies that arise due to maintaining a combined ECC for directory and data bits. Therefore, we benefit from the more efficient use of bits provided by the combined ECC with virtually no performance penalty compared to maintaining separate ECC bits for data and directory. Finally, the optimizations used in Piranha are general and applicable to other designs that use ECC-based directories.
منابع مشابه
Characterization of a List-Based Directory Cache Coherence Protocol for Manycore CMPs
The development of efficient and scalable cache coherence protocols is a key aspect in the design of manycore chip multiprocessors. In this work, we review a kind of cache coherence protocols that, despite having been already implemented in the 90s for building large-scale commodity multiprocessors, have not been seriously considered in the current context of chip multiprocessors. In particular...
متن کاملA Novel Lightweight Directory Architecture for Scalable Shared-Memory Multiprocessors
There are two important hurdles that restrict the scalability of directory-based shared-memory multiprocessors: the directory memory overhead and the long L2 miss latencies due to the indirection introduced by the accesses to directory information, usually stored in main memory. This work presents a lightweight directory architecture aimed at facing these two important problems. Our proposal ta...
متن کاملEfficient elliptic curve cryptosystems
Elliptic curve cryptosystems (ECC) are new generations of public key cryptosystems that have a smaller key size for the same level of security. The exponentiation on elliptic curve is the most important operation in ECC, so when the ECC is put into practice, the major problem is how to enhance the speed of the exponentiation. It is thus of great interest to develop algorithms for exponentiation...
متن کاملEecient Implementation of Cache Coherence in Scalable Shared Memory Multiprocessors
The cache coherence scheme for a scalable distributed shared memory multiproces-sor should be eecient in terms of memory overhead for maintaining the directories, as well as network latency for a memory request. In this paper, we propose a cache coherence scheme which minimizes the memory access delay and at the same time, reduces the directory overhead by using a limited directory scheme. In t...
متن کاملA Versatile Directory Scheme(Dir2NB+L) and Its Implementation on BY91-1 Multiprocessors System
Cache coherence and synchronization between processors have been two critical issues in designing a shared memory multiprocessors system. From the perspective of hardware design, a directory based cache coherence protocol and lock mechanism are employed to prevent inconsistency of caches and warrant atomic memory accesses. The BY91-1 multiprocessors ejiciently integrate supports for cache coher...
متن کامل